Excel File Extraction
This document explains the Excel file contact extraction pipeline powered by pandas integration. It covers how the system detects phone number and name columns in Excel files (.xlsx and .xls), processes pandas DataFrames, handles NaN values and data types, and manages encoding considerations. It also documents fallback mechanisms when pandas encounters corrupted Excel files and provides examples of supported formats, column naming variations, and common issues.
The Excel extraction feature is implemented in two primary locations:
A standalone CLI script that extracts contacts from CSV, TXT, and Excel files
A Flask API that accepts uploads, routes to the same extraction logic, and returns structured results
CLI extractor"] APP["app.py
Flask API"] REQ["requirements.txt
Dependencies"] VN["validate_number.py
Phone validator"] end subgraph "Local Prototypes" CLIF["cli_functions.py
Excel usage examples"] BETA["beta_pandas.py
DataFrame usage demo"] end EC --> VN APP --> EC REQ --> APP CLIF --> APP BETA --> APP
Diagram sources
Section sources
Phone number cleaning and normalization
Removes separators and validates digit count
Adds international prefix when applicable
Column detection algorithm
Keyword-based matching for phone and name columns
Fallback selection when keywords are absent
pandas DataFrame processing
Reading Excel files with pandas
Iterating rows and handling NaN values
Fallback mechanisms
Graceful handling of exceptions during Excel parsing
Minimal return when extraction fails
Key implementation references:
Phone cleaning: clean_phone_number
Excel extraction: extract_contacts_from_excel
Column detection: keyword matching loop
Row iteration and NaN handling: row processing
Section sources
The system integrates pandas for Excel parsing and applies a consistent column detection and phone cleaning workflow across formats.
Diagram sources
Excel Column Detection Algorithm#
The algorithm identifies phone and name columns using keyword matching against column names:
Phone keywords: [“phone”, “number”, “mobile”, “cell”, “tel”]
Name keywords: [“name”, “contact”, “person”]
Selection logic:
If any phone column is found, the first match is used
Otherwise, the first column is used as phone
If any name column is found, the first match is used
Otherwise, the second column is used as name (if available)
Diagram sources
Section sources
pandas DataFrame Processing Workflow#
Reading Excel files
Uses pandas to load .xlsx and .xls files
Internally relies on installed engines (openpyxl for .xlsx, xlrd for .xls)
Iterating rows
Iterates over DataFrame rows to extract values
Handling NaN values
Checks for numeric NaN types and skips invalid entries
Data type conversion
Converts values to string before stripping and cleaning
Encoding considerations
The extraction logic does not enforce encoding; pandas defaults apply
References:
DataFrame creation: pd.read_excel
Row iteration and NaN checks: row processing
Dependencies for engines: requirements.txt
Section sources
Automatic Phone Number and Name Column Identification#
Phone column identification
Keywords searched in lowercase column names
First matching column is selected; otherwise first column
Name column identification
Keywords searched in lowercase column names
First matching column is selected; otherwise second column if available
Fallback behavior
If no columns match, the algorithm falls back to first/second columns
References:
Keyword matching: phone and name detection
Section sources
Phone Number Cleaning and Validation#
Removes separators and non-digits except plus sign
Strips leading zeros when not international
Adds plus sign for international-like numbers
Validates digit count to ensure realistic lengths
References:
Cleaning logic: clean_phone_number
Section sources
Fallback Mechanisms for Corrupted Excel Files#
The Excel extraction function wraps pandas loading in a try-except block
On failure, the function returns an empty list without raising errors
This prevents API crashes and allows graceful degradation
References:
Exception handling: try-except around pd.read_excel
Section sources
Supported Excel Formats and Column Naming Variations#
Supported formats
.xlsx and .xls are supported via pandas read_excel
Engines: openpyxl for .xlsx, xlrd for .xls
Column naming variations
Phone columns: “phone”, “number”, “mobile”, “cell”, “tel” (case-insensitive)
Name columns: “name”, “contact”, “person” (case-insensitive)
Practical examples
Column names like “Mobile Number”, “Tel”, “Contact Person” are recognized
If none match, the algorithm uses the first column as phone and second as name (if present)
References:
Engines: requirements.txt
Keyword matching: column detection
Section sources
Common Issues with Excel File Processing#
Empty or malformed Excel files
pandas may raise errors; the extractor catches and returns empty results
Missing expected columns
The algorithm falls back to first/second columns; ensure data layout aligns with expectations
Mixed data types
Values are coerced to strings before cleaning; ensure phone numbers are readable text or numbers
Encoding and locale differences
The extractor does not enforce encoding; rely on pandas defaults
References:
Error handling: exception handling
Engine-related errors: EmptyDataError and KeyError examples
Section sources
The Excel extraction depends on pandas and its engines for reading Excel files.
Diagram sources
Section sources
Large Excel files
Reading and iterating rows scales linearly with the number of rows
Consider chunking or limiting rows for very large datasets
Keyword matching
Linear scan over columns; negligible overhead compared to IO
Memory usage
Entire DataFrame is loaded into memory; consider streaming alternatives for extremely large files
Engine choice
openpyxl is efficient for .xlsx; xlrd for .xls; ensure correct engine is installed
[No sources needed since this section provides general guidance]
Excel file not readable
Verify file format and engine installation
Confirm that the file is not password-protected or corrupted
Unexpected empty results
Check column names for expected keywords
Ensure phone numbers are present and not entirely blank
Phone number validation failures
Confirm the number meets digit count requirements after cleaning
Review separator characters and prefixes
References:
Engine installation: requirements.txt
Validation logic: clean_phone_number
Section sources
The Excel contact extraction pipeline leverages pandas to read .xlsx and .xls files, applies robust keyword-based column detection, and cleans phone numbers consistently. It gracefully handles exceptions and provides fallback behavior for corrupted or misformatted files. By aligning column names with supported keywords and ensuring proper engine installation, users can reliably extract contacts from Excel spreadsheets.